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Abstract 

Due  to  chip  area  and  pin  count  constraints,  large 
concentrator  switches  sometimes  must  be  partitioned 
among  several  chips.  This  paper  presents  designs 
for  two  multicliip  partial  concentrator  switches,  both 
of  which  follow  from  a  lemma  showing  that  an  e- 
nearsorter  is  also  an  (n,m,l  —  e/m)  partial  concen¬ 
trator. 

The  first  switch,  based  on  the  Revsort  algorithm,  is 
an  (n,  m,  1  —  0{n^^‘*/m))  partial  concentrator  switch 
with  at  most  2>/n  +  f(lgn)/2]  data  pins  per  chip, 
0(i/n)  chips,  and  volume  0(n^/^).  A  message  incurs 
31gn-|-0(l)  gate  delays  in  passing  through  the  switch. 

The  second  switch,  based  on  Columnsort,  is  an 
(n,m,  1  —  0{n-~^^/m))  partial  concentrator  switch 
with  0(n^)  data  pins  per  chip,  0(n*“^)  chips,  and 
volume  0(n*'*'^'),  for  any  1/2  <  /?  <  1.  A  message 
incurs  40  Ig  ri  -I-  0(  1 )  gate  delays. 

1  Introduction 

The  problem  of  concentrating  relatively  few  signals 
on  many  input  lines  onto  a  lesser  number  of  output 
lines  must  be  solved  in  many  kinds  of  communication 
networks.  In  many  parallel  computing  systems,  in¬ 
formation  is  packaged  into  messages  which  are  routed 
among  the  processors.  The  switches  that  route  these 
messages  .sometimes  require  more  chip  area  or  input 
and  output  wires  than  a  single  chip  can  supply.  This 
paper  presents  two  designs  for  fast  multichip  partial 
concentrator  switches  suitable  for  routing  bit-serial 
messages  in  a  p  irallel  supercomputer.  The  key  lemma 
of  this  paper  may  be  used  to  justify  other  ]>artial  con¬ 
centrator  designs. 

This  researcli  was  supported  in  part  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  rinder  Contract  NOOOl  I 
80-Ct-06'2^  and  in  part  by  a  National  Science  Foundation 
Fellowship. 


An  n-by-m  perfect  concentrator  switch  has  n  in¬ 
put  wires  A'l ,  A2, . . . ,  A'„  and  m  <  n  output  wires 
Ti,  >2, . . . ,  Tm-  The  switch  can  establish  m  disjoint 
electrical  paths  from  any  set  of  m  input  wires  to  the 
m  output  wires.  A  perfect  concentrator  switch  al¬ 
ways  routes  as  many  messages  as  possible.  Specifi¬ 
cally,  whenever  k  out  of  the  n  input  wires  of  an  n-by- 
»n  perfect  concentrator  switch  carry  messages,  one  of 
the  following  is  true: 

•  If  ^*  <  m,  then  an  electrical  path  is  established 
from  each  input  wire  that  contains  a  message  to 
an  output  wire. 

•  If  !•  >  m,  then  each  output  wire  has  an  electrical 
path  established  from  an  input  wire  that  contains 
a  message. 

When  k  >  m,  some  messages  cannot  be  successfully 
routed,  in  which  case  we  say  the  switch  is  congested. 
Typical  ways  of  handling  unsuccessfully  routed  mes¬ 
sages  in  a  routing  network  are  to  buffer  them,  to  mis- 
routc  them,  or  to  simply  drop  them  and  rely  on  a 
higher-level  acknowledgment  protocol  to  detect  this 
situation  and  resend  them.  The  switch  designs  in  this 
paper  are  compatible  with  any  of  these  congestion  con¬ 
trol  methods. 

One  way  to  create  a  perfect  concentrator  switch  is 
with  a  hyperconcentrator  switch.  An  n-by-n  hyper¬ 
concentrator  switch  has  n  input  wires  A'l ,  A’’2,  •  •  • ,  A'„ 
and  n  output  wires  Vj ,  V2, . . . ,  Vn.  The  switch  can 
establish  disjoint  electrical  paths  from  any  set  of  k  in¬ 
put  wires,  for  any  1  <  it  <  n,  to  the  first  k  output 

wires  Vj ,  V 2 . V’j. .  In  other  words,  we  route  the  k 

messages  to  the  first  k  output  wires.  We  can  make 
any  n-by-m  perfect  concentrator  switch  from  an  n- 
by-n  hyperconcentrator  switch  by  simply  choosing  the 
first  m  output  wires  of  the  hyperconcentrator  switch, 
T),V'2.  .,V'm.  as  the  m  output  wires  of  the  perfect 

concentrator  switch. 


All  efficient  ri-by-n  hyperconcentrator  switch  ilesign 
is  given  in  [1]  and  [2].  This  switch  has  a  highly  i-egu- 
lar  layout  in  both  ratioed  nMOS  and  domino  CMOS 
technologies,  and  a  signal  incurs  exactly  2  Ig »  gate 
delays  through  the  switch.*  This  switch  uses  0(n-) 
components  and  has  area  0(n-). 

Partitioning  this  hyperconcentrator  switch  among 
multiple  chips  with  p  pins  each  requires  0((»i/p)‘) 
chijis,  since  each  p-pin  chip  has  area  t)(p-’)  and  there 
are  0{n-)  components  to  partition.  We  may  need  to 
partition  the  switch  for  two  reasons: 

1.  The  0(n-)  area  may  exceed  the  available  chi|) 
area. 

2.  If  the  switch  is  to  be  packaged  by  itself  on  a  chip, 
it  may  require  more  input  and  output  pins  than 
are  provided  by  the  packaging  technology. 


An  ( n/o  .  iii/o .  o  )  partial  concentrator  switch  can 
be  used  anywhere  an  n-by-?n  perfect  concentrator 
switch  is  required.  Consider  a  set  of  k-  <  m  mes¬ 
sages  to  be  routed  through  an  n-by-m  perfect  con¬ 
centrator  switch.  For  the  (n/o,  m/o,o)  partial  con¬ 
centrator  switch,  we  have  that  k  <  m  =  a  (m/a), 
aiul  thus  all  k  messages  are  routed  to  output  wires. 
If  there  are  instead  k  >  m  messages  to  be  routed 
through  the  perfect  concentrator  switch,  we  have  that 
k  >  in  =  o  ■  (in/o )  for  the  (11/0,111/0,0)  partial  con¬ 
centrator  switch,  and  thus  in  output  wires  carry  mes¬ 
sages.  In  either  case,  the  partial  concentrator  switch 
performs  the  same  function  as  the  perfect  concentra¬ 
tor  swit  ch,  at  the  cost  of  a  l/o-factor  increase  in  the 
number  of  input  and  output  wires. 

In  this  paper,  we  show  a  connection  between  near¬ 
sorting  and  partial  concentration.  We  then  use  this 
relationship  to  design  two  efficient  multichiii  partial 


A  different  hyperconcentrator  switch,  compri.sed  of  a 
parallel  prefix  circuit  and  a  butterfly  network  [1],  can 
be  built  in  volume  0(n^/^)  with  O(nlgn)  chips  and 
as  few  as  four  data  pins  per  chip,  but  this  switch  is 
not  combinational.  Although  its  sequential  control 
is  not  very  complex,  it  is  not  as  simple  as  that  of  a 
combinational  circuit. 

Partial  concentrator  switches,  as  we  shall  s<‘e  in  Sec¬ 
tions  4  and  5,  can  be  combinational  with  relatively 
low  gate  delays.  Yet,  given  chips  with  p  pins,  we  can 
partition  n-input  partial  concentrator  switches  using 
only  0(n/p)  chips.  An  (n,m,  o)  partial  ronrnitralor 
switch  hcis  n  input  wires  A'l,  A'2, . . . ,  A'„,  m  <  n  out¬ 
put  wires  V'l ,  V'2 _ .V’m,  and  a  fraction  0  <  o  <  1 

such  that  disjoint  electrical  paths  may  be  established 
from  any  set  of  k  input  wires,  for  any  I  <  k  <  am,  to 
k  output  wires. 

A  lightly  loaded  partial  concentrator  switch  is  sim¬ 
ilar  to  a  perfect  concentrator  switch.  If  there  are  k 
messages  entering  an  {n,m,a)  partial  conci'iit rator 
switch,  one  of  the  following  is  true: 

•  If  F  <  am.  then  an  electrical  path  is  established 
from  each  input  wire  that  contains  a  message  to 
an  output  wire. 

•  If  k  >  am.  then  at  least  am  elect riial  paths  are 
established  from  infuit  wires  conlainiiig  mess.agi-s 
to  output  wires. 

We  call  the  fraction  a  the  load  ratio  If  a  partial  con¬ 
centrator  switch  is  lightly  loadeil,  i  e.,  the  nuiiiber  of 
mes.sages  entering  is  at  most  am.  then  all  the  mes¬ 
sages  are  routed  to  output  wires 


*  We  tlir  notation  Ig  ti  to  f|rn0l<'  ri 


concentrator  switches,  both  of  which  use  the  hyper- 
concentrator  switch  of  [1]  and  [2]  as  a  subcircuit  on  a 
single  chip. 

The  remainder  of  this  paper  is  organized  as  fol¬ 
lows.  Section  2  covers  some  basic  terminology  and 
describes  the  me.ssage  format  upon  which  the  switches 
are  based.  Section  3  defines  nearsorting  and  shows  the 
relationship  between  nearsorting  and  partial  concen¬ 
tration.  Section  4  presents  a  design  for  a  partial  con¬ 
centrator  switch  based  on  the  Revsort  algorithm  for 
sorting  on  a  mesh;  Section  5  does  the  same,  but  based 
on  the  Columnsort  algorithm  for  sorting  on  a  mesh. 
Finally.  Section  6  contains  further  remarks  about  mul¬ 
tichip  concentrator  switches, 

2  Preliminaries 

In  tlii.><  section,  we  define  some  basic  terminology  and 
mathematical  conventions  and  present  the  message 
format  a.ssiiined  by  the  switch  designs. 

Hit  and  booh'an  values  are  denoted  by  "I'"  and  “0" 
for  Iliri  and  K.M.St;  respi  ct ively. 

We  a.ssiime  that  t  lie  switches  rouli’  htl-srtial  mes- 
.sagr\  Facli  me.ssage  is  formed  by  a  stream  of  bits 
aiTi\  ing  at  a  wii<'  at  the  rate  of  one  bit  per  clock  cy¬ 
cle.  riif'  first  liit  of  each  message  that  arrives  at  an 
input  will'  is  the  rnlid  hit,  indicating  whether  subse- 
<)uent  bits  arriving  on  that  wire  form  a  valid  message 
or  an  invalnl  message.  The  bit  sequence  following  a 
valid  bit  of  1  forms  a  valid  lurs.sage,  which  we  would 
like  to  he  routed  from  an  input  wire  to  an  output  wire 
of  the  switch.  I'roiii  there  it  may  pass  through  the 
remainder  of  the  routing  network.  A  valid  bit  of  0 
indicati-s  an  invalid  mr.'i.sagi,  whe  b  does  not  need  to 
be  route. I  to  an  output  wire. 
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The  valid  bits  all  arrive  at  the  input  wires  of  a 
switch  during  the  same  clock  cycle,  which  we  call 
setup.  An  external  control  line  signals  sef\tp.  Mes¬ 
sage  bits  entering  through  input  wires  at  cycles  after 
setup  follow  the  electrical  paths  in  the  switch  that  are 
established  during  setup. 

We  shall  adopt  some  notational  conventions  to  ease 
the  exposition  iu  the  remainder  of  this  paper.  Upper¬ 
case  symbols  denote  wire  names  and  lowercase  sym¬ 
bols  denote  intt'ger  values.  We  shall  also  use  upper¬ 
case  symbols  to  denote  bit  values  on  the  wires  they 
name  when  the  usage  is  unambiguous.  Wire  names 
will  usually  be  subscripted. 

A  sequence  of  values  is  sorted  if  it  is  in  nonincreas¬ 
ing  order.  The  valid  bits  output  by  an  n-by-n  hyper- 
concentrator  switch  are  thus  sorted,  since  if  there  are 
k  valid  messages,  we  have 

yi,y2,..,,y't  =  i 
^’*■+1.  H-+2,  ■  •  •  >  =  0 

during  setup. 

Concentrators  were  originally  presented  as  graphs 
in,  for  example,  [4,5,8].  The  term  "hyperconcentra¬ 
tor”  is  due  to  Valiant.  Vertex-disjoint  paths  from  <les- 
igiiated  input  nodes  to  designated  output  nodes  are 
the  concentrator  graph  counterpart  of  the  combina¬ 
tional  routing  paths  established  during  setup  in  the 
concentrator  switches  of  this  paper. 

3  Nearsorting  and  Partial  Concentra¬ 
tion 

in  this  section,  we  define  f-nearsorting  and  show  its 
relationship  to  partial  concentration.  The  key  ieiiima 
proven  in  this  section  is  used  in  the  next  two  sections 
to  justd'y  partial  concentrator  switch  constructions, 

A  sequence  of  values  is  s-uearsorted  if  each  element 
in  till?  sequence  is  within  £■  positions  of  where  it  be¬ 
longs  ill  the  fully  sorted  sequence.  For  exam()le.  the 
sequence  .5.. 3,  6,  1,4,2  is  2-nearsorted  since  each  ele¬ 
ment  is  at  most  two  places  away  from  its  correct  po¬ 
sition  in  the  fully  sorted  sequence  (i,  .5, -I,  3.  2.  I .  T  he 
value  e  need  not  be  a  constant:  we  will  usually  let  z  be 
a  function  of  the  size  of  the  sequence.  A  fully  sorted 
sequence  is  also  0-nearsorted 

Siii'-e  we  are  only  interesterl  in  nearsorling  valid 
bits,  for  the  remainder  of  this  paper  we  shall  be  con¬ 
cerned  only  with  inputs  whose  value  is  either  0  or 
1.  We  say  that  a  sequence  of  values  is  dean  if  they 
all  have  the  same  value;  otherwise  the  sequence  is 
dirty.  The  following  lemma  describes  an  r-nearsorted 
.sequence  of  O's  and  I’s 


k  n-k 
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Figure  1:  A  fully  sotted  sequence  of  k  I’s  and  n  —  k 
O’s  and  an  e-nearsorted  sequence  of  the  same  values.  The 
c-nearsorted  sequence  consists  of  a  clean  sequence  of  at 
least  k  —  £  I’s  followed  by  a  dirty  sequence  of  at  most  2e 
bits  followed  by  a  clean  sequence  of  at  least  n  —  k  ~  £  O’s. 

Lemma  1  A  sequence  of  n  bits,  containing  k  I’s  and 
n  —  k  O's.  is  e-nearsorted  if  and  only  if  it  consists  of 
a  clean  sequence  of  at  least  k  —  e  I’s  followed  by  a 
dirty  sequence  of  at  most  2e  bits  followed  by  a  clean 
sequence  of  at  least  n  —  k  —  e  O’s. 

Proof  (^)  As  shown  in  Figure  1,  a  fully  sorted  se¬ 
quence  of  k  I's  and  n  —  k  O’s  is  simply  k  I’s  followed 
by  n  -  it  O's.  In  an  e-nearsorted  sequence  of  the  same 
values,  each  1  appears  within  the  first  ifc  +  r  positions, 
and  each  0  appears  within  the  last  n  —  k  +  e  positions. 
The  only  dirty  sequence  within  the  e-nearsorted  se¬ 
quence  is  therefore  centered  at  the  kth  position  and 
extends  e  positions  to  either  side.  The  lemma  then 
follows. 

(<=)  Again  referring  to  Figure  1,  each  1  is  within 
the  first  (fe  -b  e  positions,  and  each  0  is  within  the  last 
n  —  F-t-e  positions.  The  sequence  is  thus  f-nearsorted. 

□ 

The  following  lemma  is  the  key  lemma  that  relates 
e-ncarsorting  to  partial  concentration. 

Lemma  2  Let  P  be  a  switch  with  u  inputs 
A'l ,  A'2 . A'„  and  n  outputs  Yi ,  >’2 . .  and  sup¬ 

pose  that  P  z-nearsorts  valid  bits.  Then  by  restricting 

the  outputs  of  P  to  yj,y2 . y’m,  for  any  w  <  n. 

P  IS  an  (71,  m,a)  partial  concentrator  switch,  where 
n  =  1  —  zjiri. 

Proof  Consider  any  input  to  switch  P  containing  k 
I's  and  n  —  k  O's.  Me  have  arn  =  (1  —z/ni)m  =  m  —  e. 
and  there  are  two  cases. 

Case  1:  k  <  nm  —  m  —  e.  We  have  m  >  k  -i-  e. 
Since  P  is  an  f-nearsorter,  each  1  appears  within  the 

outputs  {y'l.yij . Vlt+t}  C  {yj,  y2, . . . ,  ^m}  Thus, 

each  1  is  routed  to  an  output  of  the  partial  concentra¬ 
tor  switch. 
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Figure  2:  The  output  of  an  (n.rn,  1— t/jn)  partial  concen¬ 
trator  switch  that  is  not  e-nearsorted.  This  switch  routes 
m  —  f  out  of  t  >  m  —  ff  Ts  to  the  first  m  outputs,  but  the 
remaining  k  —  m  s  I's  are  routed  to  the  last  k  —  m  -f  j 
out  of  the  n  outputs.  If  we  have  k  +  e  <  n  —  (k  —  in  +  i). 
or  equivalently,  k  +  e  <  (n  +  m)/'2.  tlien  the  bast  k  —  rn  +  c 
I's  are  not  within  e  positions  of  output  11,  and  'iius  the 
output  sequence  is  not  e-ncarsorted. 


C'asr  J.'  k  >  nin  —  in  —  e.  VVc  liave  in  < 
k  +  z.  Again,  eacli  1  appears  witliiii  tli<‘  outputs 
{Vi  ,  yi,  .  From  F.emina  1,  w<'  know  that 

at  most  z  of  the  outputs  {I'l,  Vo.  . . ,  ll+j  (  carry  O’s, 

so  at  most  e  of  the  outputs  {V|,'1'2 . V,,,}  carry 

O’s.  Thus,  at  lecist  m  -  e  =  nin  of  tlie  outputs 
{V'l,  Vo, ....  Vm }  carry  I’s. 

We  conclude  that  by  restricting  tlic  outputs  of  I*  to 
Vi ,  Vo, .  .  . ,  V,n.  P  is  an  (n,  in,  1  -  z/in)  partial  concen¬ 
trator  swilcli.  □ 

The  converse  of  Lemma  2  is  not  neces.sarily  true. 
As  shown  in  Figure  2.  if  an  (n,in,  I  -  f/m)  partial 
concentrator  switch  routes  in  —  e  out  of  k  >  nin  = 
tn~;  I ’s  to  the  first  in  outputs,  the  remaining  A--nt+r 
I’s  may  be  routed  to  the  last  k  —  in  +  r  out  of  the  n 
outputs.  In  this  case,  if  there  are  more  than  ;  outputs 
between  V*  and  ll'c  output  sequence 

is  not  f-liearsorted. 

4  A  Revsort-Based  Partial  Concen¬ 
trator  Switch 


types  with  at  niosi  '2^(1  +  [(Ig  u )/2]  pins,  and  two 
ho.ird  ty|)es. 

'I  he  tlesign  is  based  t)n  Srhnorr  and  Shamir's 
Hevsort  algorithm  for  sorting  on  a  mesh  [7],  which, 
although  not  optimal  for  .sorting  on  a  mesh,  is  sim¬ 
ple.  'file  id<‘a  behind  the  partial  concentrator  switch 
is  to  nearsort  a  ■yii-by-Y/n  matrix  of  valid  bits.  The 
III  output  wires  of  the  switch  correspond  to  the  first 
III  nearsoried  mat rix  eiit ries 

W  e  need  some  basic  definitions.  W'c'  assume  that  the 
rows  and  columns  of  the  yii  x  ,/n  matrix  are  num¬ 
bered  0.  1 . \/n  —  I  and  that  ^n  =  2'*  for  some  in¬ 

teger  1/  We  al.so  define,  for  any  integer  i,  0  <  »  <  x/n, 
iTi(i)  to  be  the  binary  number  obtained  by  reversing 
the  bits  in  the  binary  re|)resentation  of  i.  including 
the  leading  zeros.  For  example,  when  y/n  =  US.  rev(3) 

IS  12. 

The  parti.d  concentrator  switch  is  built  from  three 
stages,  each  stage  containing  x/n  hy|)erconccntrator 
chips.  IJach  x/t'-by-x/n  hyperconcentrator  chip  serves 
to  fully  .sort  a  row  or  column  of  valid  bits  in  the  un¬ 
derlying  matrix.  We  shall  denote  by  ///.,  the  jth  hy- 
perconci'iitrator  chip  in  stage  /,  for  1  <  /  <  3  and  0  < 
t  <  xA'  "'dh  input  wires  A';,,,o,  A'i,,',i , . .  ,  A'; , 

and  output  wires  >'(,,,0.  V|,,,i . V,  , 

'file  general  idea  of  the  construction  of  the  partial 
concent ralor  switch  is  as  follows.  Each  stage  1  chip 
corresponds  to  a  column  of  tlie  matrix,  so  tlie  stage  1 
chips  fully  sort  the  valid  bits  in  each  column.  The 
input  and  output  wires  A'l,;.,  and  Vi,j,i  represent  the 
value  of  the  matrix  element  at  row  j  and  column  j 
before  and  after  sorting. 

The  wiring  between  stages  1  and  2  is  effectively  a 
matrix  traus|)osition,  accomplished  by  connecting  the 
oul|>ut  wire  Vj  ;  ,  to  the  input  wire  for  0  <  i,j  < 

x/n  Each  stage  2  chip  then  corresponds  to  a  row  of 
the  matrix,  so  the  stage  2  chips  fully  sort  the  valid 


In  this  section,  we  present  a  design  for  an  (n.in.n) 
partial  conceutralnr  switch  that  uses  (-)(x/n)  rliii>s 
with  only  f-)(x/n)  data  pins  each  fhe  ba.sic  building 
block  is  the  liyperconciMii ralor  switch  of  [1]  and  [2] 
|>!a.  i  i|  oil  a  cliip  E.ach  message  incurs  3  Ig  n  4  t^fl) 
gale  ili'lays  in  passing  through  the  switch  'fh<  loa<l 
ratio  is  r»--  1  —  0(n'^^'/;n )  Most  of  I  he  results  of  t  Ins 
.'cction  originally  appeared  in  [1], 

'fhis  partial  concentrator  switch  can  be  im|de- 
mented  in 

•  two  dimensions  with  0(n-)  an-a  and  on<-  chip 
type  with  2x/n  data  pins,  or 

•  three  dimensions  with  volume,  two  chip 


bits  in  each  row.  'fhe  input  and  output  wires  At  ■  j 
and  V-_. ,  J  represent  the  value  of  the  matrix  element  at 
row  i  and  colnmii  j  before  and  after  sorting 

fhe  wiling  between  slage.s  2  and  3  is  the  compo 
silion  of  ixxe,  matrix  permiit  at  icnis  We  first  cycli¬ 
cally  rolati-  low  I  by  iti'(/)  place,s  to  the  right,  for 
II  <1  e  lhal  IS,  the  matrix  element,  in  row  i 

and  column  j.  lor  (I  <  i.  j  <  y/n ,  is  moved  to  row  i 
ainl  1011111111  (  n  r(i)  4-  J)  mod  ,/n.  'fhe  matrix  is  then 
lrans]iosed.  Each  stage  3  rhiji  then  corresponds  to  a 
column  of  I  he  mat  rix.  so  the  stage  3  cliips  fully  sort  the 
vali<l  bus  III  e.ach  colimm  fhe  two  permutations  are 
accomplished  in  one  wiring  st  ep  by  Conner  1  ing  t  he  out  - 
|)Ul  wire  to  the  input  wire 

for  0  <  i.  J  <  y/Ti 


The  output  wires  of  the  partial  ooiireutrator  switch 
are  the  first  ni  output  wires  of  the  matrix  in  row-uiajor 
order,  or  V3  j  ,  for  0  <  /  <  [ui/x/uj  and  0  <  j  <  s/n 
or  i  =  0  <  j  <  tn  mod  •^/u. 

Like  the  hyperconcentrator  chips  from  wliich  it  is 
built,  tlie  partial  concentrator  switch  is  a  combina¬ 
tional  circuit.  The  routing  paths  are  established  by 
the  valid  bits  during  setup,  and  subsernu-nt  bits  fol¬ 
low  along  these  paths. 

I'o  see  that  this  construction  does  indeed  yieUI  an 
{n,m,  1  —  0{n^^‘'/vi))  partial  concentrator  switch,  we 
first  observe  that  its  operation  is  etiuivaleni  to  the 
following  algorithm,  which  corresponds  to  the  first  1  ^ 
iterations  of  Revsort: 

Algoritlini  1  Given  a  ^/n  x  y/u  matrix  with  y/n  = 
2"*  and  matri.x  element  values  of  0  or  1,  ()erform  the 
following  four  steps: 

1.  Fully  sort  the  columns. 

2.  Fully  sort  the  rows. 

d.  For  0  <  «  <  y/n,  cyclically  rotate  row  i  by  rcr(i) 
places  to  the  right,  i.e.,  move  the  element  in  col¬ 
umn  j  to  column  (rei’(j)  -|-  j)  mod  y/ii. 

4.  Fully  sort  the  columns. 

The  three  sorting  steps  correspond  to  the  three  stages 
of  hyperconcentrator  chips  in  the  partial  concentra¬ 
tor  switch  construction.  The  wiring  between  stages  1 
and  2  corresponds  to  changing  from  sorting  columns 
to  sorting  rows.  The  wiring  between  stages  2  and  3 
corresponds  to  the  cyclic  rotations  within  rows  and 
changing  from  sorting  rows  to  sorting  columns.  We 
are  now  ready  to  prove  that  this  construction  works. 

Thooroni  3  The  Rersorl-hasrd  coiislntrltoii  yidds 
an  (ii.iii.  1  —  ^  /  ui))  imrlial  ronrr  n  Irator  sii  ilrli . 

I'ruof  Uoth  [1]  and  [7]  show  that  after  running  Al¬ 
gorithm  1  on  a  y/ri  x  y/ii  matrix  with  elements  val¬ 
ued  0  or  1.  the  matrix  consists  of  oidy  clean  rows 
of  I's  at  the  top,  clean  rows  of  O’s  at  the  bottom, 
and  at  most  2  —  1  dirty  rows  in  the  middle. 

Since  each  row  contains  y/Ji  elements,  there  are  at 
most  0(n^^'*)  dirty  bits.  By  Lemma  1,  the  .s<>qnenre 
IS  0( )-nearsorted.  and  by  Lemma  2.  the  circuit  is 
an  (n,)u,  1  —  0( n^^‘*/n)))  partial  concentrator  switch. 

□ 

Figiiri  3  shows  a  t wcvdimensional  layout  of  the 
switch  using  3y/ri  hyperconcentrator  chips,  with  ‘2y/tt 
ilata  pins  each.  We  simi>ly  n.se  crossbar  wiring  to 
permute  the  wires  between  hyperconcentrator  chips 


Figure  3:  A  two-dimensional  layout  of  the  Revsort-based 
partial  concentrator  switch  with  n  =  64  inputs  and  m  =  28 
outputs.  The  electrical  paths  established  by  24  valid 
messages  are  shown  with  heavy  lines.  The  output  wires 
are  the  top  four  output  wires  of  hyperconcentrator  chips 
Hi.o,  Hi.} ,  H3.7,  H3.3  and  the  top  three  output  wires  of  hy¬ 
perconcentrator  chips  Hi.i,  Hi.s.  Hifi,  Hi.r. 

of  consecutive  stages.  The  area  of  this  layout  is  ©(n^) 
since  the  crossbar  wiring  area  is  ©fn^),  which  dom¬ 
inates  the  total  chip  area  of  ©(n^'^).  (Each  stage 
of  y/ii-hy-y/n  hyperconcentrator  chips  consists  of  y/n 
chips,  each  with  area  ©(r?),  for  a  total  chip  area  of 
0(n»/‘-)  ) 

A  signal  incurs  2  [Ig  v/n  ]  +0[  1 )  gate  delays  in  pass¬ 
ing  through  each  chip.  The  2  8**'“  delays  are 

from  the  hyperconcentrator  switch  within  the  chip. 
The  I/O  pad  circuitry  accounts  for  the  additional  0(1) 
delay.  The  total  number  of  gate  delays  incurred  by  a 
signal  pa,ssing  through  the  entire  partial  concentrator 
switch  is  thus 

f)  fig  -1-0(1)  <  61gv/iT-l-0(l) 

=  31gn  -f-  0(1)  . 

As  shown  in  Figure  4,  we  can  package  the  partial 
concentrator  switch  in  three  dimensions  using  volume 
0(n^/')  Each  circuit  board  <onlains  one  y/n-hy-y/n 
hyperconcentrator  chip,  corresponding  to  one  row  or 
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Figure  4;  The  three-dimensional  packaging  of  the  Re vsort- based  partial  concentrator  switch  for  n  =  64.  Each  stack 
contains  y/ii  circuit  boards  and  corresponds  to  one  stage.  Each  board  contains  one  y/n-hy-y/n  hyperconcentrator  chip, 
and  boards  in  stack  2  follow  the  hyperconcentrator  chip  by  a  ^n-bit  barrel  shifter  chip  to  perform  the  cyclic  rotation  of 
each  row.  The  Igy/n  control  bits  that  determine  the  shift  amount  for  each  barrel  shifter  are  hardwired. 


column  of  the  matrix.  Each  of  the  three  stacks  con¬ 
tains  y/n  boards  and  represents  one  stage.  The  wires 
cross  stack  junctions  in  a  y/n  x  y/ii  array,  with  the 
valid  bit  value  of  the  wire  in  row  i  and  column  j  equal 
to  the  value  of  the  matrix  element  in  the  same  position 
at  the  corresponding  step  of  Algorithm  1. 

The  matrix  transpose  between  stages  1  and  2  is  per¬ 
formed  in  the  natural  way,  with  the  Jth  output  wire 
from  board  j  in  stage  1  going  straight  across  the  junc¬ 
tion  to  be  the  jth  input  wire  of  board  i  in  stage  2. 
The  wiring  permutation  between  the  hyperconcentra¬ 
tor  chips  of  stages  2  and  3  includes  the  cyclic  rotations 
of  the  rows,  followed  by  the  transpose.  The  transpose 
is  performed  in  the  natural  way  once  again.  We  per¬ 
form  the  cyclic  rotation  by  following  each  stage  2  hy¬ 
perconcentrator  chip  by  a  \/n-hit  barrel  shifter  on  the 
same  board.  The  barrel  shifter  has  y/n  input  wires, 
y/n  output  wires,  and  pg>/n]  control  bits  which,  in¬ 
terpreted  as  a  binary  integer,  determine  the  rotation 
amount ,  We  hardwire  the  control  bits  in  the  ith  board 
to  have  the  value  rfv(i). 

We  use  only  two  board  types,  ^y/Ti  hypercoiicen- 
trator  chips,  and  y/n  b.arrel  shifters  in  building  the 
switch.  .All  2i/n  boards  in  stages  1  and  3  are  identi¬ 
cal.  as  are  all  y/n  stage  2  boards.  'The  barrel  shifters 
rerpiire  2y/n  +  [Igy/n]  =  2\/n  +  |’(lg7i)/2l  data  pins. 
The  hardwiring  of  the  barrel  shifter  control  bit  values 
can  be  perfcumed  after  the  boards  have  been  fabri¬ 
cated. 

To  see  that  the  volume  is  Q{n^/-),  wo  need  only 
consider  the  stage  2  stack,  which  has  the  most  com¬ 
ponents  Each  board  contains  a  y/n-]>y-y/n  hypercon¬ 


centrator  chip  and  a  >/n-bit  barrel  shifter,  both  hav¬ 
ing  area  Q(n).  The  whole  stack  of  y/n  boards,  and 
therefore  the  entire  switch,  has  volume 

Since  the  barrel  shift  amounts  are  hardwired  and 
never  change,  the  barrel  shifters  introduce  only  a  con¬ 
stant  number  of  gate  delays.  A  signal  therefore  incurs 
3  Igu  -f  0(1)  gate  delays  in  passing  through  the  three- 
dimensional  switch. 

Letting  p,  the  number  of  pins  per  chip,  be  Q{y/n), 
both  the  two-dimensional  and  three-dimensional  lay¬ 
outs  use  only  0(rj/p)  chips. 


A  Coluninsort-Based  Partial  Con¬ 
centrator  Switch 


In  this  section,  we  present  a  design  for  an  (n,m,o) 
partial  concentrator  switch  that  uses  0(n*“^)  chips 
with  &(n^)  pins  each,  where  1/2  </?<!.  The  ija- 
sic  building  Idock  is  a  0(r7'’)-by-0(T?'^)  hyperconcen¬ 
trator  chip.  Each  message  incurs  4/?lgn  -1-0(1)  gate 
delays  in  pa.ssing  through  the  switch.  The  load  ratio 
is  n  =  1  —  0(n-~'‘/ni).  This  switch  can  be  imple¬ 
mented  in  I  wo  dimensions  with  area  O(n')  or  in  thrtH' 
dimensions  with  volume  ©(ri''*"^).  Table  1  shows  re¬ 
source  measnr('S  for  the  Rev.sort-ba.sed  switch  and  the 
values  of  t3  at  which  the  switch  of  this  section  matches 
them  asymptotically. 

The  design  is  ba.S'-d  on  Leighton's  Colnmnsort  al- 
gcu'ilhm  [3]  for  .sorting  u  elements  on  an  r  x  mesh, 
where  n  =  i  s  and  .s  I'venly  divides  r.  The  idea  Ix'hind 
this  partial  conrentralor  switch  is  to  («—  l)--nearsort 
an  r  x  .s  matrix  of  valid  bits.  As  with  the  switch  of 


Revsort 


1 


Columnsort, 


pins  per  chip 
chip  count 
load  ratio 
gate  delays 
volume 


e(n'^) 

1  —  0(n^^‘/m) 
31gn  +  0(l) 
e(n^^^)  ~ 


e(n’/^) 


1  —  0{n/m) 


2lgn  +  0(l) 


Columnsort, 

0  =  5/8 
©(n*’'») 

1  —  0{n^^* /m) 
|lsn  +  0(l) 


Columnsort, 
y?  =  3/4 

1  -0(n*/VT^ 
31gn  +  0(1) 


Table  1;  Resource  measures  for  the  Revsort-based  partial  concentrator  switch  and  the  vrdues  of  0  at  which  the  Column- 
sort-based  switch  matches  them  asymptotically. 
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Figure  5;  Row-major  and  column-major  positions  of  ele¬ 
ments  in  a  6  X  3  matrix. 

the  previous  section,  the  m  output  wires  of  the  switch 
correspond  to  the  first  m  matrix  entries. 

We  may  identify  a  matrix  entry  by  either  its  row 
and  column  position  or  by  its  position  in  row-major 
or  column-major  order.  All  numbering  starts  at  0. 
Thus,  the  rows  are  numbered  0, 1, . . .,  r  —  1  and  the 
columns  are  numbered  0, 1, . . . ,  s  —  1.  The  row-major 
position  of  the  matrix  entry  in  row  i  and  column  j 
is  RM{i,j)  =  si  -f  j,  and  its  column-major  position 
is  CM(i,j)  =  rj  -I-  i.  For  example.  Figure  5  shows 
the  row-major  and  column-major  positions  of  a  6  x  3 
matrix.  We  have  that  0  <  RM{i,j),C'M{i,j)  < 
n.  The  row  and  column  position  corresponding  to 
the  entry  in  row-major  position  x  is  RM~^(x)  — 
([x/sj  ,x  mod  s). 

The  partial  concentrator  switch  is  built  from  two 
stages,  each  stage  containing  s  hyperconcentrator 
chips.  Since  the  hyperconcentrator  chips  are  combi¬ 
national,  so  is  the  partial  concentrator  switch.  Each 
r-by-r  hyperconcentrator  chip  corresponds  to  a  col¬ 
umn  of  the  underlying  matrix,  fully  sorting  the  col¬ 
umn  We  shall  denote  by  Hij  the  jth  hyperconcen¬ 
trator  chip  in  stage  /,  for  /  =  1, 2  and  0  <  j  <  s,  with 
input  wires  Xij  o,  Xi  j  j , . . . ,  Xij^r-i  and  output  wires 
Wires  X/jj  and  Yij^i  corre¬ 
spond  to  the  matrix  element  in  row  i  and  column  j. 

The  wiring  between  stages  1  and  2  corresponds 
to  converting  the  matrix  from  column-major  to  row- 


major  ordering,  using  the  composition  of  functions 
RM~^  oCM.  We  connect  the  output  wire  to 

the  input  wire  X2,(r>+i)mod.,L(rj+i)/.J,  for  0  <  t  <  r 
and  0  <  j  <  s. 

Once  again,  the  output  wires  of  the  partial  con¬ 
centrator  switch  are  the  first  m  output  wires  of  the 
matrix  in  row-major  order.  We  use  wires  y2,;',i  for 
0  <  »  <  [m/sj  and  0  <  j  <  s  or  i  =  [m/sj  and 
0  <  J  <  m  mod  s. 

To  show  that  this  circuit  (s  —  Ij^-nearsorts  the  valid 
bits,  we  first  observe  that  its  operation  is  equivalent 
to  the  following  algorithm,  which  corresponds  to  the 
first  three  steps  of  Columnsort: 

Algorithm  2  Given  an  r  x  s  matrix  of  n  elements, 
where  n  =  rs,  and  matrix  values  of  0  or  1 ,  perform 
the  following  three  steps: 

1.  Fully  sort  the  columns. 

2.  Convert  the  matrix  from  column-major  to  row- 
major  order,  i.e.,  move  the  element  in  row  i  and 
column  j  to  row  [(rj  -)-  i)/sj  and  column  (rj  + 
i)  mod  s. 

3.  Fully  sort  the  columns. 

The  two  stages  of  hyperconcentrator  chips  correspond 
to  steps  1  and  3,  and  the  wiring  between  the  stages 
corresponds  to  step  2.  This  correspondence  between 
the  circuit  and  Columnsort  allows  us  to  prove  the  fol¬ 
lowing  theorem. 

Theorem  4  The  Columnsort-based  construction 
yields  an  (n,m,  1  —  (s  —  l)^/m)  partial  concentrator 
switch. 

Proof  Leighton  shows  in  [3]  that  Algorithm  2  is  an 
(s—  l)*-nearsorter  when  the  matrix  elements  are  taken 
in  row-major  order.  By  Lemma  2,  the  circuit  is  an 
(n,m,  1  — (s— l)^/m)  partial  concentrator  switch  when 
the  outputs  are  taken  in  row-major  order.  □ 
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Figure  6:  A  two-dimensional  layout  of  the  Coliimn- 
sort-based  partial  concentrator  switch  with  n  =  32  inputs 
and  m  =  18  outputs.  The  underlying  matrix  is  8  x  4.  The 
electrical  paths  established  by  14  valid  messages  are  shown 
with  heavy  lines.  The  output  wires  are  the  first  five  out¬ 
put  wires  of  hyperconcentrator  chips  H2.0  and  H2.1  and 
the  first  four  output  wires  of  hyperconcentrator  chips  H2.2 
and  H2.3 


To  acliieve  tlie  results  stated  at  the  beginning  of 
this  section,  we  let  r  =  0(ri‘^)  and  s  =  To 

ensure  that  ti  rs  and  that  .s  divides  r  as  n  increases, 
we  require  that  we  have  1/2  <  1)  <  1.  The  load  ratio 
is  then 


=  i-lizil: 


=  1-0 


The  number  of  chips  is  2s  =  0(ti*“'^),  and  eaclt  chip 
requires  2r  =  0(ri'^)  data  pins 

The  delay  through  the  switch  is  2  21gr  -(-  0(1)  = 
41gr  +  0(1)  betting  r  <  cu^  -t-  o(»i'^)  for  some  con¬ 
stant  r,  we  have  that  the  delay  is 

4lge-rO(l)  <  4  lg(cn‘^  -I- o(;r'^)) -1- 0(  1 ) 

<  41g((e-i-  l)n’*)  (for  stiff,  large  n) 

=  4.1  Ig  n  4-  4.1  lg(c  -(-  1 ) 

=  l.llgii  +  0(1)  . 

A  two-dimensional  layout  using  0(  tf’ )  area  is  shown 
in  Figure  6.  As  in  the  Rev.sort-based  switch,  we  use 
n  X  n  crossbar  wiring  to  connect  the  stages. 

Figure  7  shows  a  three-dimensional  packaging  of  the 
switch  using  volume  0(r-.s)  =  0(u‘''’'^).  As  in  Fig¬ 
ure  6,  we  have  r  =  8  and  .s  —  4.  There  are  (wo 
stacks  of  boards,  with  each  slack  conlaitiing  s  boards 


inicr.siack  connectors 

_ 


stage  1  stage  2 

Figure  7:  I'lie  tliree-dimensional  packaging  of  the 
Columii.sort-based  partial  concentrator  switch  for  r  =  8 
and  .s  =  4.  Kach  stack  contains  s  chips,  each  of  which 
is  an  r-by-r  hyperconcentrator.  The  wiring  between  the 
stages  of  chips  performs  the  RM~'  o  CM  permutation. 
The  interstack  connectors  transpose  the  wires  from  verti¬ 
cal  to  horizontal  alignment. 


outlets 

IMMtlllljlll, 


Figure  8:  The  transposition  of  w  wires  from  vertical 
to  horizontal  alignment,  shown  for  le  =  4,  using  volume 
©(m>^). 


and  corresponding  to  one  stage  of  hyperconcentrator 
chips,  and  each  board  containing  one  r-by-r  liyper- 
con centra tor  chip. 

The  tricky  part  of  this  construction  is  the  wiring 
between  stages,  which  must  perform  the  permuta¬ 
tion  RM~'  oC'M.  On  the  first  stack,  we  group  to¬ 
gether  output  wires  wliose  column-major  numlrerings 
are  congruent  modulo  .s,  or  equivalently,  those  whose 
row  numbers  are  congruent  modulo  s.  Each  such 
group  contains  r/.s  wires.  In  Figure  7,  for  example, 
since  wo  have  .s  =  4,  we  group  together  wires  H i  a  0 
and  //i  (i  .(.  //i.n.i  at'<l  /ft, (1,5,  ffi,o,2  and  //i.o.c,  //j  n  .s 
and  //i,t),7,  etc.  In  order  to  allow  them  to  enter  the 
.stage  2  rhii>s,  these  wires  are  then  ‘  transposed”  in 
small  interstack  connectors  to  align  them  horizontally 
instead  of  vertically.  Figure  8  shows  one  way  to  trans- 
po.se  a  group  of  r/s  wires  in  volume  0((r/*)')- 

'I'he  first  stack  dominates  the  volume  of  this  con- 
stnictiQii.  We  have  ,s-  hoards,  and  each  board  contains 
a  0(r-)-area  hypercoiiceiitrntor  chip  and  an  0(  ;•-’)- 


iSM 


area  wiring  permutation.  Tlie  total  volume  of  each 
stack  is  thus  0(r^s)  =  ©(«*+'’ ).  There  are  s-  inter¬ 
stack  connectors,  each  with  volume  0({r/s)-).  for  a 
total  interstack  volume  of  0{r-)  =  0(n^^).  Since  we 
have  <  1,  the  total  interstack  volume  is 
The  total  volume  of  the  partial  concentrator  switch  is 
thus 

For  l)otli  the  two-dimensional  and  three-rlimension- 
al  layouts,  letting  p.  the  number  of  pins  per  chip,  be 
0(r).  we  use  only  0(s)  =  0(n/p)  chips.  The  three- 
dimensional  layout,  however,  uses  s~  —  Q((n/p)-)  in¬ 
terstack  connectors,  but  these  connectors  contain  only 
wiring  and  no  active  components, 

6  Concluding  Remarks 

In  this  section,  we  briefly  discuss  the  characteristics 
of  the  partial  concentrator  switches  we  have  seen  and 
then  discuss  multichip  hyperconcentrator  switches. 
Finally,  we  pose  some  open  questions. 

Both  of  the  partial  concentrator  switches  we  have 
examined  are  efficient  in  that  they  are  relatively  fast 
and  can  be  packaged  with  a  relatively  low  volume. 
They  also  allow  air  to  flow  through  in  all  three  di¬ 
mensions  an'i  may  thus  be  air-cooled. 

The  0  parameter  of  the  Columnsort-based  switch 
defines  a  tradeoff  continuum  for  the  characteristics  of 
the  switch.  As  evidenced  by  Table  1,  as  the  value  of  0 
increases,  so  do  the  number  of  pins  per  chip,  delay,  and 
volume,  but  the  load  ratio  improves  and  the  number 
of  chips  decreases. 

Rather  than  simulating  just  the  first  steps  of 
Revsort  and  Columnsort,  one  could  simulate  the  full 
algorithms  to  fully  .sort  the  valid  bits  and  thus  build 
multichip  hyperconcentrator  switches.  Compared  to 
the  partial  concentrator  switches  presented  above, 
such  hyperconcentrator  switches  have  increased  delay, 
and  a  Revsort-bcised  hyperconcentralor  switch  has  a 
greater  chip  count  and  asymptotic  volume  than  its 
partial  concentrator  counterpart. 

.Schnorr  and  Shamir  .show  in  [7]  that  if  steps  1-3 
of  Algorithm  1  are  repeated  flglgyi7]  times,  the  re¬ 
sulting  matrix  contains  at  most  eight  dirty  rows.  We 
can  then  complete  the  full  sorting  by  running  three 
iterations  of  the  Shearsort  algorithm  [6].  An  n-by-n 
hyperconcentralor  switch  based  on  the  full  Revsort  al¬ 
gorithm  consists  of  I’lglgx/n]  repetitions  of  .stacks  1 
and  2  of  Figure  4  followed  by  three  pairs  of  different 
stacks  that  simulate  Shear.sort.  (Each  Shearsort  stack 
consists  of  y/n  boards,  each  of  which  contains  a  \/n- 
hy-y/n  hyperconcentralor  chip  and  fixed  permutation 
wiring.)  A  signal  peisses  through  21glgn-f4  hypercoii- 
centrator  chips  in  such  an  n-by-ii  hyperconcentralor 


switch,  incurring  4  Ig  n  Iglg  n  -b  8  Ign  -|-  0(lglg  n)  gate 
delays.  The  switch  uses  a  total  of  0(v/nlglg«)  chips 
in  volume  0(n®/^  Iglgn). 

Similarly,  by  simulating  all  eight  steps  of  Column- 
sort,  we  can  build  a  hyperconcentrator  switch  with 
the  same  asymptotic  volume  and  chip  count  as  the 
partial  concentrator  switch  of  Section  .5.  A  signal 
pa.sses  through  four  chips  and  incurs  8/?lgn  -1-0(1) 
gate  delays  through  such  an  rr-by-n  hyperconcentra¬ 
tor  switch. 

Rather  than  wondering  how  fast  a  multichip  hyper¬ 
concentrator  switch  we  can  build,  we  might  ask  for 
what  functions  /(p)  can  we  build  an  (Q(f(p)).irt.  1  — 
o{p/m))  partial  concentrator  switch,  given  chips  with 
p  pins  and  using  only  two  stages  of  chips.  The 
Columnsort-based  construction,  for  example,  gives  us 
/(p)  =  P'~‘  for  any  0  <  e  <  1.  Can  we  achieve 
/(p)  =  n(p^)?  In  general,  how  large  a  function  /(p) 
can  we  achieve  with  t  stages? 

There  may  be  £-nearsorters  based  on  networks  other 
than  the  two-dimensional  mesh  to  which  we  can  ap¬ 
ply  Lemma  2.  What  types  of  partial  concentrator 
switches  can  we  build  by  applying  Lemma  2  to  other 
£-nearsorters? 
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